library(ggplot2)
library(tidyverse)
library(flextable) # Doing normal looking tables
library(plotly) # For the 3D scatterplot
library(gridExtra) # grids of ggplots
library(grid)
library(viridis) # for colours
I have ran optimizations both on my local device and on the HPC. My goal with this script is to compare their outputs and to see if there is a clear optimum (maximizing survival) for any threshold combination.
Load the files needed and give them recognisable names.
# Model 1.3.1 HPC optimization output
setwd("C:/Users/c0070955/OneDrive - Newcastle University/1-PHD-project/Modelling/R/Model_output/HPC/131/2023-08-26")
load('opt_outcome_concat_HPC_ 131 .Rda')
HPC_131<-HL_df
rm(HL_df)
HPC_131<-HPC_131 %>% mutate_all(as.numeric)
# Model 1.3.1 Local optimization output
setwd("C:/Users/c0070955/OneDrive - Newcastle University/1-PHD-project/Modelling/R/Model_output/Results_June23/MOD_1_3_1/3-Optimization/2023-06-22_start")
load('2023-08-01_12_05_38opt_out131d30N1000dayh8num_th50.Rda')
local_131<-outcome_opt_df
rm(outcome_opt_df)
local_131<-na.omit(local_131)
local_131$th_num<-1:nrow(local_131)
colnames(local_131)<-c('mean', 'sd', 'th1', 'th2', 'th3', 'th_num')
First determine which combinations of thresholds was optimal in the HPC and in the local run.
# For hpc
HL_opt<-HPC_131[(which.max(HPC_131$mean)),]
# bind together
HL_opt<-rbind(HL_opt, (local_131[(which.max(local_131$mean)),]))
HL_opt$type[1]<-'HPC'
HL_opt$type[2]<-'Local'
HL_opt %>% flextable()
mean | sd | th1 | th2 | th3 | th_num | type |
|---|---|---|---|---|---|---|
2,693.686 | 3,064.398 | 0.06530612 | 0.10612245 | 0.3836735 | 16,302 | HPC |
2,676.947 | 3,334.457 | 0.01632653 | 0.04081633 | 0.3102041 | 8,449 | Local |
These are different. To check if the two optimizations resulted in a similar output overall, plot the 3D image of all half-life survival. Note that I have set the colourscales so these are comparable.
# Plot the HPC outcome
hpc_plot<-plot_ly(HPC_131, x = ~th1, y = ~th2, z = ~th3,
marker=list(color = ~mean, cmin=1000, cmax=3000, colorscale='Viridis', showscale=TRUE),
text=~paste('Mean HL:', mean)) %>%
#add_markers(color=~mean, marker=list()) %>%
layout(scene = list(xaxis = list(range=c(0, 0.4),title = 'TH1'),
yaxis = list(range=c(0, 0.4),title = 'TH2'),
zaxis = list(range=c(0, 0.4),title = 'TH3')),
title = list(text='1.3.1 Mean Halflife (in timesteps) - HPC', y=0.95))
# Plot the local outcome
local_plot<-plot_ly(local_131, x = ~th1, y = ~th2, z = ~th3,
marker=list(color = ~mean, cmin=1000, cmax=3000, colorscale='Viridis', showscale=TRUE),
text=~paste('Mean HL:', mean)) %>%
#add_markers(color=~mean, marker=list()) %>%
layout(scene = list(xaxis = list(range=c(0, 0.4),title = 'TH1'),
yaxis = list(range=c(0, 0.4),title = 'TH2'),
zaxis = list(range=c(0, 0.4),title = 'TH3')),
title = list(text='1.3.1 Mean Halflife (in timesteps) - Local', y=0.95))
hpc_plot
local_plot
The graphs look superfically the same. Now, I will run the
environment functions for these specific threshold values and compare
these. I run the models with the standard settings of
days=30, N=1000 and
daylight_h=8
# Retrieve the function files that are needed
setwd("C:/Local_R/BiPhD-ABM/May23")
source('MOD_1_FuncSource.R')
source('ModelSource.R')
# run the model for otherwise the standard settings - HPC
env_func_1_3_1_par(days = 30, N= 1000, th_forage_sc1 = HL_opt$th1[1], th_forage_sc2 = HL_opt$th2[1], th_forage_sc3 = HL_opt$th3[1], daylight_h = 8, modelType = 131)
HPC_env_out<-output_env_func
# And for the local
env_func_1_3_1_par(days = 30, N= 1000, th_forage_sc1 = HL_opt$th1[2], th_forage_sc2 = HL_opt$th2[2], th_forage_sc3 = HL_opt$th3[2], daylight_h = 8, modelType = 131)
loc_env_out<-output_env_func
After running, I compare the survival curves of each of the threshold sets across all environments.
# add a column
for (i in 1:18){
HPC_env_out[[2]][[i]]$Type<-rep("HPC")
HPC_env_out[[2]][[i]]$env<-rep(paste(i))
loc_env_out[[2]][[i]]$Type<-rep("loc")
loc_env_out[[2]][[i]]$env<-rep(paste(i))
}
# merge them
output<-map2_dfr(HPC_env_out[[2]], loc_env_out[[2]], bind_rows)
# subset survival
output_surv<-subset(output, output$id=="alive")
output_surv$env<-as.numeric(output_surv$env)
ggplot(output_surv, aes(x=output_surv$timestep, y=output_surv$value, color=output_surv$Type) ) +
geom_line(size=1) +
scale_color_brewer(palette='Accent')+
facet_wrap(~env, ncol=3)
As Tom already suggested, the lines are different in environment 12 and 13. Note that this is for two different parameter combinations. The next step is to Find the equivalent of the HPC-optimum on the local run and the other way around. Compare those. I’m also realising I should probably retrieve the actual runs, instead of running the models again.
# For the HPC retrieve the specific run : TH 16302
setwd("C:/Users/c0070955/OneDrive - Newcastle University/1-PHD-project/Modelling/R/Model_output/HPC/131/2023-08-26/09-batch/")
load("outcome_1_3_1_HPC_th 16302 .Rda")
HPC_opt_run<-env_results
# For the HPC but from the local optimal run: TH 8449
setwd("C:/Users/c0070955/OneDrive - Newcastle University/1-PHD-project/Modelling/R/Model_output/HPC/131/2023-08-26/05-batch/")
load("outcome_1_3_1_HPC_th 8449 .Rda")
HPC_loc_opt_run<-env_results
# Retrieve the optimal run for the local: TH 8449
setwd("C:/Users/c0070955/OneDrive - Newcastle University/1-PHD-project/Modelling/R/Model_output/Results_June23/MOD_1_3_1/3-Optimization/2023-06-22_start")
load("opt_run_env.Rda")
loc_opt_run<-output_env_func
# And retrieve the local run with the HPC optimum: TH 16302
load("opt_hpc_run_env.Rda")
loc_hpc_opt_run<-output_env_func
# add a column
for (i in 1:18){
HPC_opt_run[[2]][[i]]$Type<-rep("HPC_opt_HPC_run")
HPC_opt_run[[2]][[i]]$env<-rep(paste(i))
HPC_loc_opt_run[[2]][[i]]$Type<-rep("loc_opt_HPC_run")
HPC_loc_opt_run[[2]][[i]]$env<-rep(paste(i))
loc_opt_run[[2]][[i]]$Type<-rep("loc_opt_loc_run")
loc_opt_run[[2]][[i]]$env<-rep(paste(i))
loc_hpc_opt_run[[2]][[i]]$Type<-rep("hpc_opt_loc_run")
loc_hpc_opt_run[[2]][[i]]$env<-rep(paste(i))
}
# merge them
a<-do.call('rbind', HPC_opt_run[[2]])
b<-do.call('rbind',HPC_loc_opt_run[[2]])
c<-do.call('rbind', loc_opt_run[[2]])
d<-do.call('rbind', loc_hpc_opt_run[[2]])
output_4<-rbind(a,b,c,d)
# subset survival
output_4surv<-subset(output_4, output_4$id=="alive")
output_4surv$env<-as.numeric(output_4surv$env)
# plot
ggplot(output_4surv, aes(x=output_4surv$timestep, y=output_4surv$value, color=output_4surv$Type ) ) +
geom_line(size=0.75) + scale_color_manual(values=c("#FFDB6D", "#C4961A","#C3D7A4", "#52854C"))+ facet_wrap(~env, ncol=3)
The yellow lines (both referring to the threshold combination that is optimal according to the HPC run) and the green lines (referring to the optimum combinations according to the local run) are similar. This indicates that for these particular combinations of thresholds, the survival curves are the same in the HPC optimization and in the local optimization.
In environment 12 and 13, the set of green lines differs from the set of yellow/brown lines, indicating that the local optimum combination doesn’t perform as well as the combination that the HPC gave. This confuses me, because, if the optimal combination works this well, why was it not indicated as an optimum? –> I think this has to do with the general mean (across all environments).
The next step is to Compare those means. They are as follows:
row1<-c('HPC - 16302', 'local', 2411.853)
row2<-c('HPC - 16302', ' hpc',2693.68 )
row3<-c('Local - 8449', 'local',2676.94 )
row4<-c('Local - 8449', ' hpc',2253. )
hl_table<-as.data.frame(rbind(row1, row2, row3, row4))
colnames(hl_table)<-c('TH comb', 'run type', 'mean HL')
hl_table%>%flextable()
TH comb | run type | mean HL |
|---|---|---|
HPC - 16302 | local | 2411.853 |
HPC - 16302 | hpc | 2693.68 |
Local - 8449 | local | 2676.94 |
Local - 8449 | hpc | 2253 |
This shows that there is nothing ‘faulty’ with the mean halflives: For the HPC run the HPC combination is best and for the local run the local combination is best. The next step is to get an idea of the sperate halflifes in each of the environmetns for both the local and the HPC optimum.
# call the halflife function (copied from function source file )
t_halflife_func<-function(halflife_input){
for (i in 1:length(halflife_input)){
if (i==1){
# list for the t_HL
t_HL_list<<-list()
# list for general fit summaries
fit_sum_list<-list()
}
# Create the dataframe you'll be dealing with
df<-subset(halflife_input[[i]], halflife_input[[i]]$id=='alive')
# clean up the dataframe
df$timestep<-as.numeric(df$timestep)
df<-df[,2:3]
colnames(df)<-c('y', 't')
# Now fit the model
# To control the interations in NLS I use the following
nls.control(maxiter = 100)
# I use a basic exponential decay curve, starting values need to be given
fit<-nls(y ~ a*exp(-b*t), data=df,
start=list(a=1, b=0.0000001))
# pull out hte summary --> this has the estimated values for a an db in it
sum_fit<-summary(fit)
# put in the list
fit_sum_list[[i]]<-sum_fit$parameters
# Now, where does it cross the x-axis?
# Set the current a & b
cur_a<-fit_sum_list[[i]][1]
cur_b<-fit_sum_list[[i]][2]
# set the halflife
y_halflife<-0.5
# now calculate the timestep at which this will occur
t_halflife<-(-(log(y_halflife/cur_a)/cur_b))
# calculate y from there (just to check)
#ytest<-(cur_a*exp(-cur_b*t_halflife))
# put in the list
t_HL_list[i]<<-t_halflife
}
return(t_HL_list)
}
# Rewrite the following line cause it is a mess!
HPC_halflife_perEnv<-as.data.frame(t(data.frame(t(sapply((t_halflife_func(halflife_input = HPC_opt_run[[2]] )),c)))))
HPC_halflife_perEnv$env<-1:18
Local_halflife_perEnv<-as.data.frame(t(data.frame(t(sapply((t_halflife_func(halflife_input = loc_opt_run[[2]] )),c)))))
Local_halflife_perEnv$env<-1:18
halflife_perEnv<-cbind(HPC_halflife_perEnv$V1, Local_halflife_perEnv)
colnames(halflife_perEnv)<-c('HPC', 'Local', 'Env')
halflife_perEnv%>%flextable()
HPC | Local | Env |
|---|---|---|
185.9017 | 187.2570 | 1 |
375.9614 | 344.9469 | 2 |
4,868.0272 | 5,669.0224 | 3 |
138.5749 | 141.8283 | 4 |
190.8635 | 192.7407 | 5 |
335.5135 | 338.5815 | 6 |
379.9738 | 328.3885 | 7 |
6,686.2728 | 7,236.6943 | 8 |
4,841.1304 | 6,288.2347 | 9 |
169.4895 | 175.2049 | 10 |
279.0309 | 283.7875 | 11 |
1,390.9150 | 1,061.5352 | 12 |
7,032.0739 | 1,380.0802 | 13 |
7,250.1093 | 8,239.7951 | 14 |
7,419.1828 | 7,957.4920 | 15 |
217.4633 | 219.9062 | 16 |
521.1419 | 464.9984 | 17 |
6,204.7309 | 7,674.5436 | 18 |
# graph
ggp<-ggplot(output_4surv, aes(x=output_4surv$timestep, y=output_4surv$value, color=output_4surv$Type ) ) +
geom_line(size=0.75) + scale_color_manual(values=c("#FFDB6D", "#C4961A","#C3D7A4", "#52854C"))+ facet_wrap(~env, ncol=3)+geom_vline(data=HPC_halflife_perEnv, aes(xintercept=V1), color='#daa520')+
geom_vline(data=Local_halflife_perEnv, aes(xintercept=V1), color='#2e8b57')
ggp
This shows that, indeed, the local optimum combination has some higher halflives in other environments, which don’t stand out as much because of the more shallow lines. The next steps are:
Check what the distribution of HL values looks like for each of the optimization runs
Check how the optimization runs correlate with each other
Check where the optimal values fall within this correlation
I also spoke to Tom about how to proceed more generally and we discussed:
Option 1: we move away from the mean halflife across all environments. This would mean that we need to remove the middle environments (as done for ASAB conference). With only 8 environments left, I can explore how optimums for 8 environments specifically would develop and act under different circumstances. I would create different ‘evolutionary trajectories/pathways’. –> We might need to do this regardless, but we need to check first if this will actually solve the issue of the different optimization outcomes. For this, I need to check the correlation between different runs on the environment level (not just on the mean level)
Option 2: Actually show and go into this variation.Could we just select a group of trheshold combinations that are correlated in, say, 2 runs and use these? We can then repeat these 20 variables 10x times to actually hone into an optimum. I’m still not completely sure how we would do this for other models. Do we just run the optimization twice? How do we know we’re not missing something out?
par(mfrow=c(1,2))
hist(HPC_131$mean, ylim=c(0,5000), main='HPC', xlab='Mean HL', col='#daa520')
hist(local_131$mean, ylim=c(0,5000), main='Local', xlab = 'Mean HL', col='#2e8b57')
So there is a small number of threshold combinations that has the high HL values. The next step would be to check if these are the same combinations in the HPC and in the Local run.
# First, make sure to order them
HPC_order<-HPC_131[order(HPC_131$th_num),]
local_order<-local_131[order(local_131$th_num),]
plot(HPC_order$mean, local_order$mean, main='Relationship HPC mean-HL vs local mean-HL', ylab='Mean HL local', xlab='Mean HL HPC', col=ifelse((HPC_order$mean==c(2693.686, 2253.319)), 'red', 'blue'), pch=ifelse((HPC_order$mean==c(2693.686, 2253.319)), 50, 10))